Search CORE

31 research outputs found

Threshold Choice Methods: the Missing Link

Author: Ferri Cèsar
Flach Peter
Hernández-Orallo José
Publication venue
Publication date: 12/12/2011
Field of study

Many performance metrics have been introduced for the evaluation of classification performance, with different origins and niches of application: accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the absolute error, and the Brier score (with its decomposition into refinement and calibration). One way of understanding the relation among some of these metrics is the use of variable operating conditions (either in the form of misclassification costs or class proportions). Thus, a metric may correspond to some expected loss over a range of operating conditions. One dimension for the analysis has been precisely the distribution we take for this range of operating conditions, leading to some important connections in the area of proper scoring rules. However, we show that there is another dimension which has not received attention in the analysis of performance metrics. This new dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the loss of these methods for a uniform range of operating conditions we get the 0-1 loss, the absolute error, the Brier score (mean squared error), the AUC and the refinement loss respectively. This provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation, namely: take a model, apply several threshold choice methods consistent with the information which is (and will be) available about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method

arXiv.org e-Print Archive

CiteSeerX

Explore Bristol Research

Wind-sensitive Interpolation of Urban Air Pollution Forecasts

Author: Contreras Lidia
Ferri Cèsar
Publication venue: The Author(s). Published by Elsevier B.V.
Publication date: 31/12/2016
Field of study

AbstractPeople living in urban areas are exposed to outdoor air pollution. Air contamination is linked to numerous premature and pre-native deaths each year. Urban air pollution is estimated to cost approximately 2% of GDP in developed countries and 5% in developing countries. Some works reckon that vehicle emissions produce over 90% of air pollution in cities in these countries. This paper presents some results in predicting and interpolating real-time urban air pollution forecasts for the city of Valencia in Spain. Although many cities provide air quality data, in many cases, this information is presented with significant delays (three hours for the city of Valencia) and it is limited to the area where the measurement stations are located. We compare several regression models able to predict the levels of four different pollutants (NO, NO2, SO2, O3) in six different locations of the city. Wind strength and direction is a key feature in the propagation of pollutants around the city, in this sense we study different techniques to incorporate this factor in the regression models. Finally, we also analyse how to interpolate forecasts all around the city. Here, we propose an interpolation method that takes wind direction into account. We compare this proposal with respect to well-known interpolation methods. By using these contamination estimates, we are able to generate a real-time pollution map of the city of Valencia

Elsevier - Publisher Connector

Technical Note: Towards ROC Curves in Cost Space

Author: Ferri Cèsar
Flach Peter
Hernández-Orallo José
Publication venue
Publication date: 01/01/2011
Field of study

ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area under the optimal cost curve. In this note we present some new findings and connections between ROC space and cost space, by using the expected loss over a range of operating conditions. In particular, we show that ROC curves can be transferred to cost space by means of a very natural way of understanding how thresholds should be chosen, by selecting the threshold such that the proportion of positive predictions equals the operating condition (either in the form of cost proportion or skew). We call these new curves {ROC Cost Curves}, and we demonstrate that the expected loss as measured by the area under these curves is linearly related to AUC. This opens up a series of new possibilities and clarifies the notion of cost curve and its relation to ROC analysis. In addition, we show that for a classifier that assigns the scores in an evenly-spaced way, these curves are equal to the Brier Curves. As a result, this establishes the first clear connection between AUC and the Brier score

arXiv.org e-Print Archive

CiteSeerX

CASP-DM: Context Aware Standard Process for Data Mining

Author: Contreras-Ochando Lidia
Ferri Cèsar
Flach Peter
Hernández-Orallo José
Kull Meelis
Lachiche Nicolas
Martínez-Plumed Fernando
Ramírez-Quintana María José
Publication venue
Publication date: 19/09/2017
Field of study

We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs

arXiv.org e-Print Archive

Explore Bristol Research

Recommended from our members

Missing the missing values: The ugly duckling of fairness in machine learning

Author: Cèsar Ferri
David Nieves
Fernando Martínez‐Plumed
José Hernández‐Orallo
Publication venue: International Journal of Intelligent Systems
Publication date: 28/05/2021
Field of study

Abstract: Nowadays, there is an increasing concern in machine learning about the causes underlying unfair decision making, that is, algorithmic decisions discriminating some groups over others, especially with groups that are defined over protected attributes, such as gender, race and nationality. Missing values are one frequent manifestation of all these latent causes: protected groups are more reluctant to give information that could be used against them, sensitive information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we present the first comprehensive analysis of the relation between missing values and algorithmic fairness for machine learning: (1) we analyse the sources of missing data and bias, mapping the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should discourage the consideration of missing values as the uncomfortable ugly data that different techniques and libraries for handling algorithmic bias get rid of at the first occasion, (3) we study the trade‐off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods), and (4) we show that the sensitivity of six different machine‐learning techniques to missing values is usually low, which reinforces the view that the rows with missing data contribute more to fairness through the other, nonmissing, attributes. We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making

Apollo (Cambridge)

An instantiation for sequences of hierarchical distance-based conceptual clustering

Author: Ferri Cèsar
Funes Ana
Hernández-Orallo Jose
Ramírez-Quintana María José
Publication venue
Publication date: 21/09/2021
Field of study

In this work, we present an instantiation of our framework for Hierarchical Distance-based Conceptual Clustering (HDCC) using sequences, a particular kind of structured data. We analyze the relationship between distances and generalization operators for sequences in the context of HDCC. HDCC is a general approach to conceptual clustering that extends the traditional algorithm for hierarchical clustering by producing conceptual generalizations of the discovered clusters. Since the approach is general, it allows combining the flexibility of changing distances for different data types at the same time that we take advantage of the interpretability offered by the obtained concepts, which is central for descriptive data mining tasks. We propose here different generalization operators for sequences and analyze how they work together with the edit and linkage distances in HDCC. This analysis is carried out based on three different properties for generalization operators and three different levels of agreement between the clustering hierarchy obtained from the linkage distance and the hierarchy obtained by using generalization operators.Sociedad Argentina de Informática e Investigación Operativ

Servicio de Difusión de la Creación Intelectual

Predictable Artificial Intelligence

Author: Burden John
Burnell Ryan
Cheke Lucy
Ferri Cèsar
Hernández-Orallo José
hÉigeartaigh Seán Ó
Marcoci Alexandru
Martínez-Plumed Fernando
Mehrbakhsh Behzad
Moreno-Casares Pablo A.
Moros-Daval Yael
Rutar Danaja
Schellaert Wout
Voudouris Konstantinos
Zhou Lexin
Publication venue
Publication date: 09/10/2023
Field of study

We introduce the fundamental ideas and challenges of Predictable AI, a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to Predictable AI were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.Comment: 11 pages excluding references, 4 figures, and 2 tables. Paper Under Revie

arXiv.org e-Print Archive